37 research outputs found
LibriMix: An Open-Source Dataset for Generalizable Speech Separation
In recent years, wsj0-2mix has become the reference dataset for
single-channel speech separation. Most deep learning-based speech separation
models today are benchmarked on it. However, recent studies have shown
important performance drops when models trained on wsj0-2mix are evaluated on
other, similar datasets. To address this generalization issue, we created
LibriMix, an open-source alternative to wsj0-2mix, and to its noisy extension,
WHAM!. Based on LibriSpeech, LibriMix consists of two- or three-speaker
mixtures combined with ambient noise samples from WHAM!. Using Conv-TasNet, we
achieve competitive performance on all LibriMix versions. In order to fairly
evaluate across datasets, we introduce a third test set based on VCTK for
speech and WHAM! for noise. Our experiments show that the generalization error
is smaller for models trained with LibriMix than with WHAM!, in both clean and
noisy conditions. Aiming towards evaluation in more realistic,
conversation-like scenarios, we also release a sparsely overlapping version of
LibriMix's test set.Comment: submitted to INTERSPEECH 202
Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection
In many speech-enabled human-machine interaction scenarios, user speech can
overlap with the device playback audio. In these instances, the performance of
tasks such as keyword-spotting (KWS) and device-directed speech detection (DDD)
can degrade significantly. To address this problem, we propose an implicit
acoustic echo cancellation (iAEC) framework where a neural network is trained
to exploit the additional information from a reference microphone channel to
learn to ignore the interfering signal and improve detection performance. We
study this framework for the tasks of KWS and DDD on, respectively, an
augmented version of Google Speech Commands v2 and a real-world Alexa device
dataset. Notably, we show a 56% reduction in false-reject rate for the DDD task
during device playback conditions. We also show comparable or superior
performance over a strong end-to-end neural echo cancellation + KWS baseline
for the KWS task with an order of magnitude less computational requirements.Comment: Submitted to INTERSPEECH 202
A Time-Frequency Generative Adversarial based method for Audio Packet Loss Concealment
Packet loss is a major cause of voice quality degradation in VoIP
transmissions with serious impact on intelligibility and user experience. This
paper describes a system based on a generative adversarial approach, which aims
to repair the lost fragments during the transmission of audio streams. Inspired
by the powerful image-to-image translation capability of Generative Adversarial
Networks (GANs), we propose bin2bin, an improved pix2pix framework to achieve
the translation task from magnitude spectrograms of audio frames with lost
packets, to noncorrupted speech spectrograms. In order to better maintain the
structural information after spectrogram translation, this paper introduces the
combination of two STFT-based loss functions, mixed with the traditional GAN
objective. Furthermore, we employ a modified PatchGAN structure as
discriminator and we lower the concealment time by a proper initialization of
the phase reconstruction algorithm. Experimental results show that the proposed
method has obvious advantages when compared with the current state-of-the-art
methods, as it can better handle both high packet loss rates and large gaps.Comment: Accepted at EUSIPCO - 31st European Signal Processing Conference,
202
Learning to Rank Microphones for Distant Speech Recognition
Fully exploiting ad-hoc microphone networks for distant speech recognition is
still an open issue. Empirical evidence shows that being able to select the
best microphone leads to significant improvements in recognition without any
additional effort on front-end processing. Current channel selection techniques
either rely on signal, decoder or posterior-based features. Signal-based
features are inexpensive to compute but do not always correlate with
recognition performance. Instead decoder and posterior-based features exhibit
better correlation but require substantial computational resources. In this
work, we tackle the channel selection problem by proposing MicRank, a learning
to rank framework where a neural network is trained to rank the available
channels using directly the recognition performance on the training set. The
proposed approach is agnostic with respect to the array geometry and type of
recognition back-end. We investigate different learning to rank strategies
using a synthetic dataset developed on purpose and the CHiME-6 data. Results
show that the proposed approach is able to considerably improve over previous
selection techniques, reaching comparable and in some instances better
performance than oracle signal-based measures
Filterbank design for end-to-end speech separation
International audienceSingle-channel speech separation has recently made great progress thanks to learned filterbanks as used in ConvTasNet. In parallel, parameterized filterbanks have been proposed for speaker recognition where only center frequencies and bandwidths are learned. In this work, we extend real-valued learned and parameterized filterbanks into complex-valued analytic filterbanks and define a set of corresponding representations and masking strategies. We evaluate these fil-terbanks on a newly released noisy speech separation dataset (WHAM). The results show that the proposed analytic learned filterbank consistently outperforms the real-valued filterbank of ConvTasNet. Also, we validate the use of parameterized filterbanks and show that complex-valued representations and masks are beneficial in all conditions. Finally, we show that the STFT achieves its best performance for 2 ms windows
Performance above all ? energy consumption vs. performance for machine listening, a study on dcase task 4 baseline
In machine listening there is a tendency to resort to models with a growing number of parameters raising thus concerns about the practical viability of these due to their energy consumption. Reporting energy consumption of the models could be a first step to raise awareness on this matter. Yet, estimating the energy consumption across different conditions (hyper-parameters, GPU types etc.) poses some challenges in terms of biases and fairness of the comparison between different models and works. In this paper we perform an extensive study using the DCASE task 4 baseline system and monitor energy consumption and training time for different GPU types and batch sizes. The goal is to identify which aspects can have an impact on the estimation of the energy consumption and should be normalized for a fair comparison across systems. Additionally, we propose an analysis of the relationship between the energy consumption and the sound event detection performance that calls into question our current way to evaluate systems
An Experimental Review of Speaker Diarization methods with application to Two-Speaker Conversational Telephone Speech recordings
We performed an experimental review of current diarization systems for the
conversational telephone speech (CTS) domain. In detail, we considered a total
of eight different algorithms belonging to clustering-based, end-to-end neural
diarization (EEND), and speech separation guided diarization (SSGD) paradigms.
We studied the inference-time computational requirements and diarization
accuracy on four CTS datasets with different characteristics and languages. We
found that, among all methods considered, EEND-vector clustering (EEND-VC)
offers the best trade-off in terms of computing requirements and performance.
More in general, EEND models have been found to be lighter and faster in
inference compared to clustering-based methods. However, they also require a
large amount of diarization-oriented annotated data. In particular EEND-VC
performance in our experiments degraded when the dataset size was reduced,
whereas self-attentive EEND (SA-EEND) was less affected. We also found that
SA-EEND gives less consistent results among all the datasets compared to
EEND-VC, with its performance degrading on long conversations with high speech
sparsity. Clustering-based diarization systems, and in particular VBx, instead
have more consistent performance compared to SA-EEND but are outperformed by
EEND-VC. The gap with respect to this latter is reduced when overlap-aware
clustering methods are considered. SSGD is the most computationally demanding
method, but it could be convenient if speech recognition has to be performed.
Its performance is close to SA-EEND but degrades significantly when the
training and inference data characteristics are less matched.Comment: 52 pages, 10 figure
Multi-Channel Target Speaker Extraction with Refinement: The WavLab Submission to the Second Clarity Enhancement Challenge
This paper describes our submission to the Second Clarity Enhancement
Challenge (CEC2), which consists of target speech enhancement for hearing-aid
(HA) devices in noisy-reverberant environments with multiple interferers such
as music and competing speakers.
Our approach builds upon the powerful iterative neural/beamforming
enhancement (iNeuBe) framework introduced in our recent work, and this paper
extends it for target speaker extraction. We therefore name the proposed
approach as iNeuBe-X, where the X stands for extraction. To address the
challenges encountered in the CEC2 setting, we introduce four major novelties:
(1) we extend the state-of-the-art TF-GridNet model, originally designed for
monaural speaker separation, for multi-channel, causal speech enhancement, and
large improvements are observed by replacing the TCNDenseNet used in iNeuBe
with this new architecture;
(2) we leverage a recent dual window size approach with future-frame
prediction to ensure that iNueBe-X satisfies the 5 ms constraint on algorithmic
latency required by CEC2;
(3) we introduce a novel speaker-conditioning branch for TF-GridNet to
achieve target speaker extraction;
(4) we propose a fine-tuning step, where we compute an additional loss with
respect to the target speaker signal compensated with the listener audiogram.
Without using external data, on the official development set our best model
reaches a hearing-aid speech perception index (HASPI) score of 0.942 and a
scale-invariant signal-to-distortion ratio improvement (SI-SDRi) of 18.8 dB.
These results are promising given the fact that the CEC2 data is extremely
challenging (e.g., on the development set the mixture SI-SDR is -12.3 dB). A
demo of our submitted system is available at WAVLab CEC2 demo